Batch Reinforcement Learning (RL) algorithms attempt to choose a policy froma designer-provided class of policies given a fixed set of training data.Choosing the policy which maximizes an estimate of return often leads toover-fitting when only limited data is available, due to the size of the policyclass in relation to the amount of data available. In this work, we focus onlearning policy classes that are appropriately sized to the amount of dataavailable. We accomplish this by using the principle of Structural RiskMinimization, from Statistical Learning Theory, which uses Rademachercomplexity to identify a policy class that maximizes a bound on the return ofthe best policy in the chosen policy class, given the available data. Unlikesimilar batch RL approaches, our bound on return requires only extremely weakassumptions on the true system.
展开▼